American Eating & Health Analysis

Keri Wheatley

January 31, 2017











Introduction

The U.S. Department of Agriculture’s Economic Research Service sponsored the Eating & Health (EH) Module of the American Time Use Survey (ATUS). The ATUS is sponsored by the Bureau of Labor Statistics and conducted by the U.S. Census Bureau. The purpose of this document is to provide information about the variables available on the ATUS EH Module data files: the EH Respondent file and the EH Activity file. The EH Module data files are currently available for 2014 and contain information gathered through the 2014 ATUS interviews. All EH Module questions were asked at the end of the ATUS interview. The supplement surveys individuals aged 15 and up from a nationally representative sample of approximately 2,100 sample households each month. Below are the main variables of interest:

       1. primary eating

       2. secondary eating (eating while performing an activity)

       3. grocery shopping

       4. meal preparation

       5. food assistance participation

       6. general health, height, and weight

       7. exercise

       8. household income











Question

       Can we find patterns between health, habits, and income?

       Can we use these patterns to predict health?











Exploratory Analysis











Data Availability

There are a total of 11212 samples in the dataset. Here is the data availability for each variable.











Correlation Matrix











Positive Correlation

Top 20 positively correlated variables











Negative Correlation

Top 20 negatively correlated variables











One-Dimensional Analysis











BMI

Sample Count: 10637

Calculation: weight (kg) / [height (m)]2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   23.60   26.60   27.77   30.70   73.60










Weight

Sample Count: 10712

Self-Reported Response: How much do you weight without shoes? (in pounds)

Note: EUWGT is bottomcoded to 98 lbs and topcoded to 340 lbs.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    98.0   145.0   170.0   176.2   200.0   340.0










Height

Sample Count: 11051

Self-Reported Response: How tall are you without shoes? (in inches)

Note: EUHGT is bottomcoded to 56 inches and topcoded to 77 inches

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   56.00   64.00   66.00   66.62   70.00   77.00










Health

Sample Count: 11128

Self-Reported Response: In general, would you say that your physical health was excellent, very good, good, fair, or poor?











Exercise

Self-Reported Response: During the past 7 days, did you participate in any physical activities or exercises for fitness and health such as running, bicycling, working out in a gym, walking for exercise, or playing sports? (Sample Count: 11155)

Self-Reported Response: How many times over the past 7 days did you take part in these activities? (Sample Count: 6993)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   4.193   5.000  38.000










Enough Food

Self-Reported Response: Which of the following statements best describes the amount of food eaten in your household in the last 30 days - enough food to eat, sometimes not enough to eat, or often not enough to eat? (Sample Count: 11161)











Fast Food

Self-Reported Response: Thinking back over the last 7 days, did you purchase any: prepared food from a deli, carry-out, delivery food, or fast food? (Sample Count: 11169)

Self-Reported Response: How many times in the last 7 days did you purchase: prepared food from a deli, carry-out, delivery food, or fast food? (Sample Count: 6440)











Grocer

Self-Reported Response: Where do you get the majority of your groceries? (Sample Count: 8208)

Self-Reported Response: What is the primary reason you shop there? (Sample Count: 8131)











Drink

Self-Reported Response: Not including plain water, were there any other times yesterday when you were drinking any beverages? (Sample Count: 11202)

Self-Reported Response: Were any of the beverages soft drinks such as cola, root beer, or gingerale? (Sample Count: 7513)

Self-Reported Response: Was the soft drink diet, regular or did you have both kinds? (Sample Count: 3037)











Primary Eating

Sample Count: 10725

Self-Reported Response: Total amount of time spent in primary eating and drinking (in minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   30.00   60.00   68.66   90.00  508.00










Secondary Eating

Sample Count: 6061

Self-Reported Response: Total amount of time spent in secondary eating and drinking (in minutes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   10.00   15.00   31.02   30.00  990.00










Income

Sample Count: 10932

Self-Reported Response: Last month, was your total household income before taxes more or less than (amount) per month?











SNAP

Sample Count: 11135

Self-Reported Response: In the past 30 days, did you or any member of this household receive SNAP or food stamp benefits?











WIC

Sample Count: 5805

Self-Reported Response: In the last 30 days, did you or any member of your household receive benefits from the WIC program, that is, the Women, Infants, and Children program?











Employment

Sample Count: 5677

Self-Reported Response: Change in spouse or unmarried partner’s labor force status or full time or part time employment status between CPS and ATUS











BMI versus…











Weight











Height











Health











Exercise

The indepedent samples t-test below shows with high confidence that people who exercise have, on average, a lower BMI than people who don’t exercise.

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -12.917, df = 7218.6, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.904665 -1.402735
## sample estimates:
## mean of x mean of y 
##  27.15812  28.81182

As the frequency of exercise increases, the positive effects of exercise will diminish. Let’s try taking the log transformation of the variable.











Enough Food











Fast Food











Grocer











Drink











Primary Eating











Secondary Eating











Total Eating











Income











SNAP

The indepedent samples t-test below shows with high confidence that the sample mean for BMI in the SNAP program is higher than the sample mean for people not in the program.

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 8.3221, df = 1261.2, p-value = 2.221e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.469371 2.375842
## sample estimates:
## mean of x mean of y 
##  29.49899  27.57639










WIC

The indepedent samples t-test below shows with high confidence that the sample mean for BMI in the WIC program is higher than the sample mean for people not in the program.

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.57, df = 397.73, p-value = 6.519e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9762109 2.4502044
## sample estimates:
## mean of x mean of y 
##  29.16994  27.45674










Employment











Grocer versus…











SNAP











WIC











Multiple Linear Regression











Predicting BMI

Linear Models Predicting BMI
Dependent variable:
BMI
(1) (2) (3) (4) (5)
Log_Exercise_Freq -1.125*** -1.080*** -0.512*** -0.508***
(0.075) (0.250) (0.074) (0.074)
Sq_Log_Exercise_Freq -0.023
(0.121)
Excellent_Health -5.767*** -5.347*** -5.160***
(0.397) (0.403) (0.405)
Very_Good_Health -4.052*** -3.747*** -3.564***
(0.395) (0.399) (0.401)
Good_Health -1.580*** -1.392*** -1.246**
(0.402) (0.404) (0.405)
Fair_Health 0.108 0.199 0.266
(0.438) (0.439) (0.439)
SNAP -0.733**
(0.234)
Constant 28.893*** 28.887*** 30.716*** 30.966*** 32.203***
(0.102) (0.109) (0.385) (0.388) (0.568)
Observations 10,221 10,221 10,221 10,221 10,221
R2 0.023 0.023 0.104 0.109 0.110
Adjusted R2 0.022 0.022 0.104 0.108 0.109
Residual Std. Error 6.104 (df = 10219) 6.104 (df = 10218) 5.844 (df = 10216) 5.830 (df = 10215) 5.826 (df = 10214)
F Statistic 236.038*** (df = 1; 10219) 118.029*** (df = 2; 10218) 297.647*** (df = 4; 10216) 249.183*** (df = 5; 10215) 210.261*** (df = 6; 10214)
Note: p<0.05; p<0.01; p<0.001

 

 

AIC (1) = 6.59910^{4} AIC (2) = 6.59910^{4} AIC (3) = 6.5110^{4} AIC (4) = 6.50510^{4} AIC (5) = 6.50410^{4}

 

 











Model 1 Plots











Model 2 Plots











Model 3 Plots











Model 4 Plots

Assumptions of Gauss-Markov Theorem

  1. Linear model

                        PASS: This is a weak assumption

  1. Random sampling - independently and identical data

                        PASS: Assumption based on prior knowledge of the data gathering process

  1. No perfect multicollinearity

                        PASS: Based on correlation matrix

  1. Zero-conditional mean or exogeneity - unbiased estimator

                        PASS: Based on the plot below

  1. Homoscedasticity - band is not uniform thickness (FAIL)

                        FAIL: There is evidence of heteroscedasticity, so we will use heteroskedasticity-robust standard errors

  1. Normality of Residuals

                        FAIL: There is slight upward tick on the qq plot; however we can leverage OLS asymptotics with the large sample size





















Model 5 Plots











Predicting Health

Linear Models Predicting Health
Dependent variable:
Health
(1) (2) (3) (4)
Log_Exercise_Freq -0.351*** -0.297***
(0.013) (0.013)
BMI 0.054*** 0.049***
(0.002) (0.002)
Fast_Food_Freq -0.022*** -0.031***
(0.005) (0.004)
Constant 2.840*** 0.996*** 2.533*** 1.484***
(0.017) (0.048) (0.014) (0.052)
Observations 10,221 10,221 10,221 10,221
R2 0.075 0.099 0.002 0.155
Adjusted R2 0.074 0.099 0.002 0.155
Residual Std. Error 1.019 (df = 10219) 1.006 (df = 10219) 1.058 (df = 10219) 0.974 (df = 10217)
F Statistic 823.163*** (df = 1; 10219) 1,124.116*** (df = 1; 10219) 20.831*** (df = 1; 10219) 625.544*** (df = 3; 10217)
Note: p<0.05; p<0.01; p<0.001

 

 

AIC (1) = 2.9410^{4} AIC (2) = 2.91210^{4} AIC (3) = 3.01710^{4} AIC (4) = 2.84710^{4}

 

 











Model 1 Plots











Model 2 Plots











Model 3 Plots











Model 4 Plots











K-Means Clustering

Using the numeric features:

       1. BMI

       2. Health (from ordinal to numeric)

       3. Frequency of exercise

       4. Frequency of fast food

       5. Time spent primary eating

       6. Time spent secondary eating

Click here for Jupyter Notebook











What’s Next?

       1. Feature engineering

       2. Weighted K-means clustering

       3. New algorithms (hierarchical clustering)

       4. New Features

       5. More data











Death by Visualization











Exclude Secondary Eating











Exclude Primary Eating











Primary + Secondary